Automatic batching/instancing of draw commands #9685

superdump · 2023-09-03T22:33:41Z

Objective

Implement the foundations of automatic batching/instancing of draw commands as the next step from GPU Instancing #89
- NOTE: More performance improvements will come when more data is managed and bound in ways that do not require rebinding such as mesh, material, and texture data.

Solution

The core idea for batching of draw commands is to check whether any of the information that has to be passed when encoding a draw command changes between two things that are being drawn according to the sorted render phase order. These should be things like the pipeline, bind groups and their dynamic offsets, index/vertex buffers, and so on.
- The following assumptions have been made:
  - Only entities with prepared assets (pipelines, materials, meshes) are queued to phases
  - View bindings are constant across a phase for a given draw function as phases are per-view
  - batch_and_prepare_render_phase is the only system that performs this batching and has sole responsibility for preparing the per-object data. As such the mesh binding and dynamic offsets are assumed to only vary as a result of the batch_and_prepare_render_phase system, e.g. due to having to split data across separate uniform bindings within the same buffer due to the maximum uniform buffer binding size.
Implement GpuArrayBuffer for Mesh2dUniform to store Mesh2dUniform in arrays in GPU buffers rather than each one being at a dynamic offset in a uniform buffer. This is the same optimisation that was made for 3D not long ago.
Change batch size for a range in PhaseItem, adding API for getting or mutating the range. This is more flexible than a size as the length of the range can be used in place of the size, but the start and end can be otherwise whatever is needed.
Add an optional mesh bind group dynamic offset to PhaseItem. This avoids having to do a massive table move just to insert GpuArrayBufferIndex components.

Benchmarks

All tests have been run on an M1 Max on AC power. bevymark and many_cubes were modified to use 1920x1080 with a scale factor of 1. I run a script that runs a separate Tracy capture process, and then runs the bevy example with --features bevy_ci_testing,trace_tracy and CI_TESTING_CONFIG=../benchmark.ron with the contents of ../benchmark.ron:

(
    exit_after: Some(1500)
)

...in order to run each test for 1500 frames.

The recent changes to many_cubes and bevymark added reproducible random number generation so that with the same settings, the same rng will occur. They also added benchmark modes that use a fixed delta time for animations. Combined this means that the same frames should be rendered both on main and on the branch.

The graphs compare main (yellow) to this PR (red).

3D Mesh `many_cubes --benchmark`

The mesh and material are the same for all instances. This is basically the best case for the initial batching implementation as it results in 1 draw for the ~11.7k visible meshes. It gives a ~30% reduction in median frame time.

The 1000th frame is identical using the flip tool:

     Mean: 0.000000
     Weighted median: 0.000000
     1st weighted quartile: 0.000000
     3rd weighted quartile: 0.000000
     Min: 0.000000
     Max: 0.000000
     Evaluation time: 0.4615 seconds

3D Mesh `many_cubes --benchmark --material-texture-count 10`

This run uses 10 different materials by varying their textures. The materials are randomly selected, and there is no sorting by material bind group for opaque 3D so any batching is 'random'. The PR produces a ~5% reduction in median frame time. If we were to sort the opaque phase by the material bind group, then this should be a lot faster. This produces about 10.5k draws for the 11.7k visible entities. This makes sense as randomly selecting from 10 materials gives a chance that two adjacent entities randomly select the same material and can be batched.

The 1000th frame is identical in flip:

     Mean: 0.000000
     Weighted median: 0.000000
     1st weighted quartile: 0.000000
     3rd weighted quartile: 0.000000
     Min: 0.000000
     Max: 0.000000
     Evaluation time: 0.4537 seconds

3D Mesh `many_cubes --benchmark --vary-per-instance`

This run varies the material data per instance by randomly-generating its colour. This is the worst case for batching and that it performs about the same as `main` is a good thing as it demonstrates that the batching has minimal overhead when dealing with ~11k visible mesh entities.

The 1000th frame is identical according to flip:

     Mean: 0.000000
     Weighted median: 0.000000
     1st weighted quartile: 0.000000
     3rd weighted quartile: 0.000000
     Min: 0.000000
     Max: 0.000000
     Evaluation time: 0.4568 seconds

2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d`

This spawns 160 waves of 1000 quad meshes that are shaded with ColorMaterial. Each wave has a different material so 160 waves currently should result in 160 batches. This results in a 50% reduction in median frame time.

Capturing a screenshot of the 1000th frame main vs PR gives:

     Mean: 0.001222
     Weighted median: 0.750432
     1st weighted quartile: 0.453494
     3rd weighted quartile: 0.969758
     Min: 0.000000
     Max: 0.990296
     Evaluation time: 0.4255 seconds

So they seem to produce the same results. I also double-checked the number of draws. main does 160000 draws, and the PR does 160, as expected.

2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10`

This generates 10 textures and generates materials for each of those and then selects one material per wave. The median frame time is reduced by 50%. Similar to the plain run above, this produces 160 draws on the PR and 160000 on `main` and the 1000th frame is identical (ignoring the fps counter text overlay).

     Mean: 0.002877
     Weighted median: 0.964980
     1st weighted quartile: 0.668871
     3rd weighted quartile: 0.982749
     Min: 0.000000
     Max: 0.992377
     Evaluation time: 0.4301 seconds

2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance`

This creates unique materials per instance by randomly-generating the material's colour. This is the worst case for 2D batching. Somehow, this PR manages a 7% reduction in median frame time. Both main and this PR issue 160000 draws.

The 1000th frame is the same:

     Mean: 0.001214
     Weighted median: 0.937499
     1st weighted quartile: 0.635467
     3rd weighted quartile: 0.979085
     Min: 0.000000
     Max: 0.988971
     Evaluation time: 0.4462 seconds

2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite`

This just spawns 160 waves of 1000 sprites. There should be and is no notable difference between main and the PR.

2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10`

This spawns the sprites selecting a texture at random per instance from the 10 generated textures. This has no significant change vs main and shouldn't.

2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance`

This sets the sprite colour as being unique per instance. This can still all be drawn using one batch. There should be no difference but the PR produces median frame times that are 4% higher. Investigation showed no clear sources of cost, rather a mix of give and take that should not happen. It seems like noise in the results.

Summary

Benchmark	% change in median frame time
many_cubes	🟩 -30%
many_cubes 10 materials	🟩 -5%
many_cubes unique materials	🟩 ~0%
bevymark mesh2d	🟩 -50%
bevymark mesh2d 10 materials	🟩 -50%
bevymark mesh2d unique materials	🟩 -7%
bevymark sprite	🟥 2%
bevymark sprite 10 materials	🟥 0.6%
bevymark sprite unique materials	🟥 4.1%

Changelog

Added: 2D and 3D mesh entities that share the same mesh and material (same textures, same data) are now batched into the same draw command for better performance.

superdump · 2023-09-04T14:27:08Z

I think this is ready for an initial review pass. If people are on-board with this, then I think the batching can probably be made more generic and reusable and then systems handling batching for both core and custom phases can be added more easily. Perhaps there is also a need to be able to opt-out of automatic batching if that's necessary for people to be able to do completely custom stuff. I think if the entity would just not match the object query, then the prepare systems would skip those entities/phase items.

superdump · 2023-09-04T14:31:37Z

I just noticed that this breaks many_foxes. I'll look into that.

superdump · 2023-09-04T15:07:23Z

So the reason many_foxes broke was because the batching assumes ownership of the mesh bind group, and that isn't quite true. Skinning and morph targets customise the mesh bind group to additional bind one or two other buffers at a dynamic offset per mesh entity, if a skin or morph target are present in the mesh. I've thrown in an initial hack as a separate commit that adds skinning and morph targets to MeshFlags which are per-entity mesh flags, and then if either of the flags are set on the current mesh entity, the batch gets split to be able to re-specify the dynamic offsets. This works, and it isn't necessarily a bad thing that those two properties are written to the MeshFlags and made available in the shader, but at the same time, I'm not sure I think it's the right solution. I just don't have a better idea right in this moment.

superdump · 2023-09-04T15:09:13Z

I'll probably add some batching tests when I can.

rparrett · 2023-09-04T15:10:38Z

These examples don't appear to be working properly. The rest seem OK. (29b8a29 was tested, all examples, comparing screenshots)

custom_gltf_vertex_attribute
many_foxes
mesh2d_manual

superdump · 2023-09-04T15:49:29Z

Right, many_foxes should be fixed now. I’ll check the other two. Thanks!

EDIT: Fixed.

crates/bevy_pbr/src/render/mesh.rs

JMS55 · 2023-09-05T01:52:21Z

crates/bevy_pbr/src/render/mesh.rs

+struct BatchState<'mat, 'mesh> {
+    meta: BatchMeta<'mat, 'mesh>,
+    /// The base index in the object data binding's array
+    gpu_array_buffer_index: GpuArrayBufferIndex<MeshUniform>,


We may want to consider renaming MeshUniform to ObjectUniform or something.

Yeah, I want to call it PerObjectData or PerInstanceData. Or that without Per. ObjectUniform is also a decent suggestion. I’ve avoided ‘uniform’ because it feels like it implies the data is stored in a uniform buffer and it may not be.

I like PerInstanceData imo.

Me too. I did have a concern about using the term instance and its clash with instanced rendering. But, I'm coming to think that it isn't a clash. We are using instance indices to look up the per-instance data. That they aren't in an instance-rate vertex buffer (note that being instance-rate is a qualifier to indicate that the data in that vertex buffer is per-instance and is stepped as instances are stepped) doesn't matter.

@cart any opinion on this renaming? Also, what about 2D vs 3D meshes? Per2dInstanceData? PerInstance2dData? PerInstanceData2d? I think I prefer Per2dInstanceData as an initial reaction.

crates/bevy_pbr/src/render/mesh.rs

crates/bevy_render/Cargo.toml

crates/bevy_sprite/src/mesh2d/mesh.rs

crates/bevy_pbr/src/material.rs

robtfm

first pass - i haven't done any testing yet, just a read through. the perf benefits are looking awesome though.

crates/bevy_core_pipeline/src/core_2d/mod.rs

crates/bevy_pbr/src/render/mesh.rs

crates/bevy_sprite/src/mesh2d/mesh2d_functions.wgsl

crates/bevy_pbr/src/render/mesh.rs

Some code quality

superdump · 2023-09-19T19:12:51Z

I revisited my notes in #89 (comment) and I see that when writing that, I didn't consider significant batching for CPU-driven rendering.

I noted two things:

We need to also check the dynamic offsets used for the material, if any.

I haven't implemented this. Currently the only way for a Material to have a dynamic offset as part of its binding is if a custom material contains a dynamic offset uniform/storage buffer as one of the material members. I feel like we could live with this and fix it in a separate PR if/when someone notices it is missing. Otherwise there are more gains to be had that benefit more people from focusing on other efforts.

One more problem is that objects may be / are commonly observed from multiple views, for example as soon as shadow mapping is enabled. But even doing something like split-screen, how do you decide which view should define the order of data in the per-object binding that is being built in this system? As an initial solution I have punted the problem and I just duplicate the per-object data. A follow-up could write the per-object data to an array and then store another binding with indices into that array so that only the index is duplicated, not the entire per-object data.

This approach has carried over. The same per-object data would be written multiple times to the GpuArrayBuffer. I have noted how to improve it, but I think it can also be done in a separate PR and isn't critical for an initial implementation. It will hurt users of prepasses and shadows the most I guess.

I think the best we will be able to do for CPU-driven will practically be reducing bindings so much so that encoding opaque draws is occasionally swapping material bindings and mostly just drawing instances of index/vertex ranges and nothing else. The rest will have to be GPU-driven.

robtfm · 2023-09-19T19:50:16Z

Currently the only way for a Material to have a dynamic offset as part of its binding is if a custom material contains a dynamic offset uniform/storage buffer as one of the material members.

materials can't currently use dynamic offsets as SetMaterialBindGroup::render assumes there are none. So we are not losing any functionality.

The same per-object data would be written multiple times

it's still an improvement over the current status quo where a binding+draw is issued per mesh per view.

cart

Love how straightforward this is. Very uncontroversial and clear. Just a few comments.

crates/bevy_sprite/src/mesh2d/mesh2d_bindings.wgsl

crates/bevy_sprite/src/mesh2d/mesh2d.wgsl

crates/bevy_sprite/src/mesh2d/material.rs

cart · 2023-09-21T20:14:59Z

crates/bevy_sprite/src/mesh2d/mesh.rs

        if let Some(gpu_mesh) = meshes.into_inner().get(&mesh_handle.0) {
            pass.set_vertex_buffer(0, gpu_mesh.vertex_buffer.slice(..));
+            #[cfg(all(feature = "webgl", target_arch = "wasm32"))]
+            pass.set_push_constants(


What is the story here for webgl? I thought webgl didn't support push constants (and im guessing wgpu is emulating them somehow with implicit uniforms or something?) What purpose does this serve?

Could use a comment explaining this

This is a copy-pasted workaround for how the shader instance index built in is handled in the GL backend. WGSL defines the instance index as the base instance index + the instance being drawn in this sequence. So basically the range we pass to the draw command. But GL doesn't, it is just the instance in the sequence. As such, we work around it by passing a push constant, which is implemented as a glUniform with no buffer. This was the approach I found to work around the shortcoming of the GL backend and its deviation from the semantics of what instance index in WGSL should mean. This should really be fixed in wgpu.

crates/bevy_sprite/src/mesh2d/material.rs

crates/bevy_render/src/batching/mod.rs

cart · 2023-09-21T20:42:08Z

crates/bevy_pbr/src/render/morph.rs

-        values.push((entity, MorphIndex { index }));
+        // NOTE: Because morph targets require per-morph target texture bindings, they cannot
+        // currently be batched.
+        values.push((entity, (MorphIndex { index }, NoAutomaticBatching)));


Seems like it would be safer to make "no automatic batching" the default and then opt in to automatic batching via a marker type. Is it done in reverse for perf reasons?

You could say that. I want people to get better performance by default. Why not batch if you can? And then if stuff breaks, opt-out or fix the things needed to get batching working.

I think "automatic batching" should definitely apply to user facing types for things like PbrBundle, SpriteBundle, etc. But this is a RenderApp component. If someone is building a new rendered entity type with custom rendering logic, idk if we should batch that by default if there are "hidden" data access constraints that we don't check or enforce. Provided adding a AutomaticBatching component doesn't cost us anything significant for things in the hot path, I think it makes sense to opt-in things like sprites, 2d + 3d meshes, etc.

crates/bevy_pbr/src/render/mesh_types.wgsl

crates/bevy_core_pipeline/src/core_2d/mod.rs

rparrett · 2023-09-21T21:31:57Z

Interestingly, now that bevymark is using random z values by default, I am no longer seeing a performance improvement in mesh2d mode with this PR. I still see ~2x fps with --ordered-z.

superdump · 2023-09-21T21:48:23Z

@rparrett yeah. That's because it practically has to re-bind the material between every draw as they're no longer in groups. This illustrates the cost of rebinding to update a dynamic offset between draws.

# Objective - Implement the foundations of automatic batching/instancing of draw commands as the next step from bevyengine#89 - NOTE: More performance improvements will come when more data is managed and bound in ways that do not require rebinding such as mesh, material, and texture data. ## Solution - The core idea for batching of draw commands is to check whether any of the information that has to be passed when encoding a draw command changes between two things that are being drawn according to the sorted render phase order. These should be things like the pipeline, bind groups and their dynamic offsets, index/vertex buffers, and so on. - The following assumptions have been made: - Only entities with prepared assets (pipelines, materials, meshes) are queued to phases - View bindings are constant across a phase for a given draw function as phases are per-view - `batch_and_prepare_render_phase` is the only system that performs this batching and has sole responsibility for preparing the per-object data. As such the mesh binding and dynamic offsets are assumed to only vary as a result of the `batch_and_prepare_render_phase` system, e.g. due to having to split data across separate uniform bindings within the same buffer due to the maximum uniform buffer binding size. - Implement `GpuArrayBuffer` for `Mesh2dUniform` to store Mesh2dUniform in arrays in GPU buffers rather than each one being at a dynamic offset in a uniform buffer. This is the same optimisation that was made for 3D not long ago. - Change batch size for a range in `PhaseItem`, adding API for getting or mutating the range. This is more flexible than a size as the length of the range can be used in place of the size, but the start and end can be otherwise whatever is needed. - Add an optional mesh bind group dynamic offset to `PhaseItem`. This avoids having to do a massive table move just to insert `GpuArrayBufferIndex` components. ## Benchmarks All tests have been run on an M1 Max on AC power. `bevymark` and `many_cubes` were modified to use 1920x1080 with a scale factor of 1. I run a script that runs a separate Tracy capture process, and then runs the bevy example with `--features bevy_ci_testing,trace_tracy` and `CI_TESTING_CONFIG=../benchmark.ron` with the contents of `../benchmark.ron`: ```rust ( exit_after: Some(1500) ) ``` ...in order to run each test for 1500 frames. The recent changes to `many_cubes` and `bevymark` added reproducible random number generation so that with the same settings, the same rng will occur. They also added benchmark modes that use a fixed delta time for animations. Combined this means that the same frames should be rendered both on main and on the branch. The graphs compare main (yellow) to this PR (red). ### 3D Mesh `many_cubes --benchmark` <img width="1411" alt="Screenshot 2023-09-03 at 23 42 10" src="https://github.com/bevyengine/bevy/assets/302146/2088716a-c918-486c-8129-090b26fd2bc4"> The mesh and material are the same for all instances. This is basically the best case for the initial batching implementation as it results in 1 draw for the ~11.7k visible meshes. It gives a ~30% reduction in median frame time. The 1000th frame is identical using the flip tool: ![flip many_cubes-main-mesh3d many_cubes-batching-mesh3d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2511f37a-6df8-481a-932f-706ca4de7643) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4615 seconds ``` ### 3D Mesh `many_cubes --benchmark --material-texture-count 10` <img width="1404" alt="Screenshot 2023-09-03 at 23 45 18" src="https://github.com/bevyengine/bevy/assets/302146/5ee9c447-5bd2-45c6-9706-ac5ff8916daf"> This run uses 10 different materials by varying their textures. The materials are randomly selected, and there is no sorting by material bind group for opaque 3D so any batching is 'random'. The PR produces a ~5% reduction in median frame time. If we were to sort the opaque phase by the material bind group, then this should be a lot faster. This produces about 10.5k draws for the 11.7k visible entities. This makes sense as randomly selecting from 10 materials gives a chance that two adjacent entities randomly select the same material and can be batched. The 1000th frame is identical in flip: ![flip many_cubes-main-mesh3d-mtc10 many_cubes-batching-mesh3d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/2b3a8614-9466-4ed8-b50c-d4aa71615dbb) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4537 seconds ``` ### 3D Mesh `many_cubes --benchmark --vary-per-instance` <img width="1394" alt="Screenshot 2023-09-03 at 23 48 44" src="https://github.com/bevyengine/bevy/assets/302146/f02a816b-a444-4c18-a96a-63b5436f3b7f"> This run varies the material data per instance by randomly-generating its colour. This is the worst case for batching and that it performs about the same as `main` is a good thing as it demonstrates that the batching has minimal overhead when dealing with ~11k visible mesh entities. The 1000th frame is identical according to flip: ![flip many_cubes-main-mesh3d-vpi many_cubes-batching-mesh3d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ac5f5c14-9bda-4d1a-8219-7577d4aac68c) ``` Mean: 0.000000 Weighted median: 0.000000 1st weighted quartile: 0.000000 3rd weighted quartile: 0.000000 Min: 0.000000 Max: 0.000000 Evaluation time: 0.4568 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` <img width="1412" alt="Screenshot 2023-09-03 at 23 59 56" src="https://github.com/bevyengine/bevy/assets/302146/cb02ae07-237b-4646-ae9f-fda4dafcbad4"> This spawns 160 waves of 1000 quad meshes that are shaded with ColorMaterial. Each wave has a different material so 160 waves currently should result in 160 batches. This results in a 50% reduction in median frame time. Capturing a screenshot of the 1000th frame main vs PR gives: ![flip bevymark-main-mesh2d bevymark-batching-mesh2d 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/80102728-1217-4059-87af-14d05044df40) ``` Mean: 0.001222 Weighted median: 0.750432 1st weighted quartile: 0.453494 3rd weighted quartile: 0.969758 Min: 0.000000 Max: 0.990296 Evaluation time: 0.4255 seconds ``` So they seem to produce the same results. I also double-checked the number of draws. `main` does 160000 draws, and the PR does 160, as expected. ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10` <img width="1392" alt="Screenshot 2023-09-04 at 00 09 22" src="https://github.com/bevyengine/bevy/assets/302146/4358da2e-ce32-4134-82df-3ab74c40849c"> This generates 10 textures and generates materials for each of those and then selects one material per wave. The median frame time is reduced by 50%. Similar to the plain run above, this produces 160 draws on the PR and 160000 on `main` and the 1000th frame is identical (ignoring the fps counter text overlay). ![flip bevymark-main-mesh2d-mtc10 bevymark-batching-mesh2d-mtc10 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/ebed2822-dce7-426a-858b-b77dc45b986f) ``` Mean: 0.002877 Weighted median: 0.964980 1st weighted quartile: 0.668871 3rd weighted quartile: 0.982749 Min: 0.000000 Max: 0.992377 Evaluation time: 0.4301 seconds ``` ### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance` <img width="1396" alt="Screenshot 2023-09-04 at 00 13 53" src="https://github.com/bevyengine/bevy/assets/302146/b2198b18-3439-47ad-919a-cdabe190facb"> This creates unique materials per instance by randomly-generating the material's colour. This is the worst case for 2D batching. Somehow, this PR manages a 7% reduction in median frame time. Both main and this PR issue 160000 draws. The 1000th frame is the same: ![flip bevymark-main-mesh2d-vpi bevymark-batching-mesh2d-vpi 67ppd ldr](https://github.com/bevyengine/bevy/assets/302146/a2ec471c-f576-4a36-a23b-b24b22578b97) ``` Mean: 0.001214 Weighted median: 0.937499 1st weighted quartile: 0.635467 3rd weighted quartile: 0.979085 Min: 0.000000 Max: 0.988971 Evaluation time: 0.4462 seconds ``` ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` <img width="1396" alt="Screenshot 2023-09-04 at 12 21 12" src="https://github.com/bevyengine/bevy/assets/302146/8b31e915-d6be-4cac-abf5-c6a4da9c3d43"> This just spawns 160 waves of 1000 sprites. There should be and is no notable difference between main and the PR. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10` <img width="1389" alt="Screenshot 2023-09-04 at 12 36 08" src="https://github.com/bevyengine/bevy/assets/302146/45fe8d6d-c901-4062-a349-3693dd044413"> This spawns the sprites selecting a texture at random per instance from the 10 generated textures. This has no significant change vs main and shouldn't. ### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance` <img width="1401" alt="Screenshot 2023-09-04 at 12 29 52" src="https://github.com/bevyengine/bevy/assets/302146/762c5c60-352e-471f-8dbe-bbf10e24ebd6"> This sets the sprite colour as being unique per instance. This can still all be drawn using one batch. There should be no difference but the PR produces median frame times that are 4% higher. Investigation showed no clear sources of cost, rather a mix of give and take that should not happen. It seems like noise in the results. ### Summary | Benchmark | % change in median frame time | | ------------- | ------------- | | many_cubes | 🟩 -30% | | many_cubes 10 materials | 🟩 -5% | | many_cubes unique materials | 🟩 ~0% | | bevymark mesh2d | 🟩 -50% | | bevymark mesh2d 10 materials | 🟩 -50% | | bevymark mesh2d unique materials | 🟩 -7% | | bevymark sprite | 🟥 2% | | bevymark sprite 10 materials | 🟥 0.6% | | bevymark sprite unique materials | 🟥 4.1% | --- ## Changelog - Added: 2D and 3D mesh entities that share the same mesh and material (same textures, same data) are now batched into the same draw command for better performance. --------- Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com> Co-authored-by: Nicola Papale <nico@nicopap.ch>

# Objective - since #9685 ,bevy introduce automatic batching of draw commands, - `batch_and_prepare_render_phase` take the responsibility for batching `phaseItem`, - `GetBatchData` trait is used for indentify each phaseitem how to batch. it defines a associated type `Data `used for Query to fetch data from world. - however,the impl of `GetBatchData ` in bevy always set ` type Data=Entity` then we acually get following code `let entity:Entity =query.get(item.entity())` that cause unnecessary overhead . ## Solution - remove associated type `Data ` and `Filter` from `GetBatchData `, - change the type of the `query_item ` parameter in get_batch_data from` Self::Data` to `Entity`. - `batch_and_prepare_render_phase ` no longer takes a query using `F::Data, F::Filter` - `get_batch_data `now returns `Option<(Self::BufferData, Option<Self::CompareData>)>` --- ## Performance based in main merged with #11290 Window 11 ,Intel 13400kf, NV 4070Ti ![image](https://github.com/bevyengine/bevy/assets/45868716/f63b9d98-6aee-4057-a2c7-a2162b2db765) frame time from 3.34ms to 3 ms, ~ 10% ![image](https://github.com/bevyengine/bevy/assets/45868716/a06eea9c-f79e-4324-8392-8d321560c5ba) `batch_and_prepare_render_phase` from 800us ~ 400 us ## Migration Guide trait `GetBatchData` no longer hold associated type `Data `and `Filter` `get_batch_data` `query_item `type from `Self::Data` to `Entity` and return `Option<(Self::BufferData, Option<Self::CompareData>)>` `batch_and_prepare_render_phase` should not have a query

superdump force-pushed the batching branch 5 times, most recently from 50966c1 to b45cd5c Compare September 4, 2023 14:07

superdump marked this pull request as ready for review September 4, 2023 14:15

superdump requested review from cart, IceSentry, nicopap, JMS55 and robtfm September 4, 2023 14:15

superdump force-pushed the batching branch 2 times, most recently from 42ffd8d to 2d5566d Compare September 4, 2023 14:20

superdump force-pushed the batching branch from 2d5566d to 9bf8c5b Compare September 4, 2023 15:01

superdump mentioned this pull request Sep 4, 2023

GPU Instancing #89

Open

alice-i-cecile added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Sep 4, 2023

superdump added 2 commits September 4, 2023 18:14

Implement batching for 2D and 3D meshes

5468edc

HACK: Split batches for skinning or morph targets

950abd7

superdump force-pushed the batching branch from 9bf8c5b to 950abd7 Compare September 4, 2023 16:15

JMS55 suggested changes Sep 5, 2023

View reviewed changes

robtfm reviewed Sep 5, 2023

View reviewed changes

superdump added 2 commits September 6, 2023 13:22

Only specify major and minor versions for nonmax

d3f38df

Drop consider_material from batching

806bd82

nicopap and others added 3 commits September 19, 2023 14:58

In reduce, do not change start_range value

c177c6e

prev over old

0bddc59

Merge pull request #36 from nicopap/cleanup-batch-2

5f4dd4c

Some code quality

nicopap approved these changes Sep 19, 2023

View reviewed changes

Split get_batch_data into two functions and document GetBatchData

7e08cfb

nicopap added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Sep 19, 2023

JMS55 mentioned this pull request Sep 19, 2023

Use wildcard imports in bevy_pbr #9847

Merged

rparrett mentioned this pull request Sep 20, 2023

Poor performance on basic rectangle benchmark #8100

Open

superdump added 2 commits September 21, 2023 20:37

Merge branch 'main' into batching

81b070c

Correct prepare_mesh2d_uniforms comment reference

8fccad4

cart reviewed Sep 21, 2023

View reviewed changes

superdump added 2 commits September 21, 2023 23:26

Move nonmax to bevy_utils and reexport

9ab2a95

Remove unnecessary whitespace

9334cc7

Make Material*BindGroupId Copy and used copied() in hot code

69ff8af

cart approved these changes Sep 21, 2023

View reviewed changes

cart added this pull request to the merge queue Sep 21, 2023

Merged via the queue into bevyengine:main with commit 5c884c5 Sep 21, 2023
24 of 25 checks passed

rparrett mentioned this pull request Sep 29, 2023

UI batching Fix #9610

Merged

DGriffin91 mentioned this pull request Oct 8, 2023

2d_shapes example does not render correctly in WebGL2 #10050

Closed

rparrett mentioned this pull request Oct 12, 2023

Text2d visibility fix #9708

Closed

cart mentioned this pull request Oct 13, 2023

News: Release 0.12 bevyengine/bevy-website#754

Merged

43 tasks

re0312 mentioned this pull request Jan 13, 2024

optimize batch_and_prepare_render_phase #11323

Merged

JMS55 mentioned this pull request Feb 7, 2024

Redundant copies of MeshUniform #11770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic batching/instancing of draw commands #9685

Automatic batching/instancing of draw commands #9685

superdump commented Sep 3, 2023 •

edited

Loading

superdump commented Sep 4, 2023

superdump commented Sep 4, 2023

superdump commented Sep 4, 2023

superdump commented Sep 4, 2023

rparrett commented Sep 4, 2023 •

edited

Loading

superdump commented Sep 4, 2023 •

edited

Loading

JMS55 Sep 5, 2023

superdump Sep 5, 2023

JMS55 Sep 5, 2023

superdump Sep 6, 2023

robtfm left a comment

superdump commented Sep 19, 2023

robtfm commented Sep 19, 2023

cart left a comment

cart Sep 21, 2023

cart Sep 21, 2023 •

edited

Loading

superdump Sep 21, 2023

cart Sep 21, 2023

superdump Sep 21, 2023

cart Sep 21, 2023 •

edited

Loading

rparrett commented Sep 21, 2023

superdump commented Sep 21, 2023

Automatic batching/instancing of draw commands #9685

Automatic batching/instancing of draw commands #9685

Conversation

superdump commented Sep 3, 2023 • edited Loading

Objective

Solution

Benchmarks

3D Mesh many_cubes --benchmark

3D Mesh many_cubes --benchmark --material-texture-count 10

3D Mesh many_cubes --benchmark --vary-per-instance

2D Mesh bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d

2D Mesh bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10

2D Mesh bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance

2D Sprite bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite

2D Sprite bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10

2D Sprite bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance

Summary

Changelog

superdump commented Sep 4, 2023

superdump commented Sep 4, 2023

superdump commented Sep 4, 2023

superdump commented Sep 4, 2023

rparrett commented Sep 4, 2023 • edited Loading

superdump commented Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robtfm left a comment

Choose a reason for hiding this comment

superdump commented Sep 19, 2023

robtfm commented Sep 19, 2023

cart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cart Sep 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cart Sep 21, 2023 • edited Loading

Choose a reason for hiding this comment

rparrett commented Sep 21, 2023

superdump commented Sep 21, 2023

superdump commented Sep 3, 2023 •

edited

Loading

3D Mesh `many_cubes --benchmark`

3D Mesh `many_cubes --benchmark --material-texture-count 10`

3D Mesh `many_cubes --benchmark --vary-per-instance`

2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d`

2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --material-texture-count 10`

2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --vary-per-instance`

2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite`

2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --material-texture-count 10`

2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite --vary-per-instance`

rparrett commented Sep 4, 2023 •

edited

Loading

superdump commented Sep 4, 2023 •

edited

Loading

cart Sep 21, 2023 •

edited

Loading

cart Sep 21, 2023 •

edited

Loading